The code below imports the packages used in this assignment.
The code below imports the downloaded data.
# Import Data
suppressMessages({
resale_flat_prices_2017 <- read_csv(
'data/ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv'
)
})
print('loaded')
## [1] "loaded"
In 2021, there have been 261 HDB flats transacted at or more than $1m. Compute how many such transactions there were in the last year.
Find the number of more than $1m transactions in 2023.
This would involve trimming the dataset down based on several conditional statements.
The code below extracts the list of transactions that occurred in 2023.
#Filter 2023 transactions
transactions_2023 <- resale_flat_prices_2017 %>%
filter(grepl("2023", month))
The code below filters the dataframe for transactions where resale_price >= $1m
#Extract >= $1m
transactions_2023_million <- transactions_2023 %>%
filter(resale_price >= 1000000)
The code below gets the total count of the transactions_2023_million dataframe.
#Count
count_2023_million <- nrow(transactions_2023_million)
count_2023_million
## [1] 470
There were 470 resale transactions in 2023 with prices greater than or equal to $1m.
HDB resale prices rose 0.8 per cent in December 2021 from the previous month. Do the same comparison for the same period in the last year.
Find the percentage change in resale price between December 2023 and November 2023.
The code below extracts the list of transactions for December and November 2023.
# Filter rows where 'month' equals '2023-12'
transactions_2023_12 <- transactions_2023 %>%
filter(month == "2023-12")
transactions_2023_11 <- transactions_2023 %>%
filter(month == "2023-11")
This section calculates the mean resale price for each column.
# Calculate the mean of the 'resale_price' column
mean_resale_price_12_2023 <- mean(transactions_2023_12$resale_price, na.rm = TRUE)
mean_resale_price_11_2023 <- mean(transactions_2023_11$resale_price, na.rm = TRUE)
mean_resale_price_12_2023
## [1] 578587.3
mean_resale_price_11_2023
## [1] 581146.7
This section calculates the percentage difference between both months.
# Calculate the percentage difference
percentage_difference_1a2 <- ((mean_resale_price_12_2023 - mean_resale_price_11_2023) / mean_resale_price_11_2023) * 100
print(percentage_difference_1a2)
## [1] -0.4404132
HDB resale prices fell -0.44% in December 2023 from the previous month.
HDB resale prices in December 2021 were 13.6 per cent higher than a year ago. Compute the same but for the December in the last year.
Find the percentage change in resale price between December 2023 and December 2022.
The code below extracts the list of transactions for December 2022.
# Filter rows for 2022 transactions
transactions_2022 <- resale_flat_prices_2017 %>%
filter(grepl("2022", month))
# Filter rows where 'month' equals '2022-12'
transactions_2022_12 <- transactions_2022 %>%
filter(month == "2022-12")
This section calculates the mean resale price for that month.
mean_resale_price_12_2022 <- mean(transactions_2022_12$resale_price, na.rm = TRUE)
mean_resale_price_12_2022
## [1] 549908.5
This section calculates the percentage difference between both months.
# Calculate the percentage difference
percentage_difference_1a3 <- ((mean_resale_price_12_2023 - mean_resale_price_12_2022) / mean_resale_price_12_2022) * 100
print(percentage_difference_1a3)
## [1] 5.215193
HDB resale prices in December 2023 were 5.22% per cent higher than a year ago in 2022.
There were 2,429 HDB flats sold in December 2021. Compute the same for December last year.
Find the number of transactions in December 2023.
This section counts the number of transactions.
# Calculate the count of rows
num_of_transactions_12_23 <- nrow(transactions_2023_12)
num_of_transactions_12_23
## [1] 2009
There were 2,009 HDB flats sold in December 2023.
Replicate the first plot in the article (“HDB resale volume”, the one with the blue columns), including the approximate style (it does not have to be exactly the same), but for the data in the last year.
Plot a graph showing the HDB resale volume from Dec 2022 - Dec 2023 in the Straits Times Format.
First we convert the ‘month’ column from the original dataframe to a date format to allow for easier filtering later on.
# Convert 'month' column to date format
resale_flat_prices_2017$month <- ymd(paste0(resale_flat_prices_2017$month, "-01"))
Filter to Analysis Period Obtain a list of transactions from Dec 2022 - Dec 2023.
# Extract list
transactions_2212_2312 <- resale_flat_prices_2017 %>%
filter(month >= as.Date("2022-12-01") & month <= as.Date("2023-12-01"))
This section calculates the number of transactions for each month.
# Compute the number of transactions for each month
transactions_per_month <- transactions_2212_2312 %>%
group_by(month) %>%
summarise(number_of_transactions = n(), .groups = "drop")
head(transactions_per_month)
This section imports custom fonts and sets specific colors to match the desired graph.
suppressMessages({
# Create a color vector with a default blue color, and a darker shade for the first and last bars
color_vector <- rep("#c4ddf3", nrow(transactions_per_month))
color_vector[1] <- "#599ddc" # First bar
color_vector[nrow(transactions_per_month)] <- "#599ddc" # Last bar
# Specify the path to the directory
font_path <- "C:/Users/colin/OneDrive/Desktop/data science mod/working/fonts"
# Import the fonts
font_import(paths = font_path, prompt = FALSE)
# Load fonts
loadfonts(device = "win") # Use "pdf" for PDF output on non-Windows systems
# List the fonts
fonts()
})
## [1] "EconSansCndBol" "EconSansCndBolIta" "EconSansCndLig"
## [4] "EconSansCndLigIta" "EconSansCndMed" "EconSansCndMedIta"
## [7] "EconSansCndReg" "EconSansCndRegIta"
# Plot the data
p <- ggplot(transactions_per_month, aes(x = month, y = number_of_transactions)) +
geom_col(color = "white", fill = color_vector) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0, margin = margin(b = 10), family = "EconSansCndBol", colour = "#666666"),
plot.subtitle = element_text(hjust = 0, family = "EconSansCndReg", colour = "#666666"),
axis.text.x = element_text(vjust = 1, hjust = 1, family = "EconSansCndReg", colour = "#666666"),
axis.text.y = element_text(family = "EconSansCndLig", colour = "#666666"), # Lighter grey color for Y axis text
legend.position = "none",
panel.grid.major.x = element_blank(), # Ensure no vertical grid lines
panel.grid.minor.x = element_blank(), # Ensure no minor vertical grid lines
panel.grid.major.y = element_line(color = "#d3d3d3"),
panel.background = element_blank(), # Clean background
panel.grid.minor.y = element_blank() # Remove minor horizontal grid lines
) +
labs(
title = "HDB Resale Volume",
subtitle = "No. of transactions",
x = "",
y = ""
) +
scale_y_continuous(
labels = scales::comma,
breaks = seq(0, 3000, by = 500),
limits = c(NA, 3000) # Increase the y-axis limit to 3000
) +
scale_x_date(
date_labels = "%b",
date_breaks = "1 month",
) +
geom_text(
data = subset(transactions_per_month, format(month, "%m") == "12"),
aes(label = scales::comma(number_of_transactions)),
vjust = -0.3,
fontface = "bold",
colour = "#666666"
)
# Render
p
Before starting on the analysis with insights, we scope through the dataset to understand the structure and feel of the data. We also conduct a very basic check for null values.
Display the top 5 rows to get a general feel of the data.
# View top 5 rows
head(resale_flat_prices_2017)
The dataset is arranged in ascending numerical and alphabetical order, starting with the year 2017 and with Ang Mo Kio.
Next we check the last 5 entries of the dataset to understand how updated it is.
# View bottom 5 rows
tail(resale_flat_prices_2017,5)
The latest entry is in February 2024, of which the current date is 23 February 2024. A quick check from the beta.data.gov.sg website shows this dataset was last updated 8 hours ago, thus it can be assumed that this dataset is most updated up to this current week.
We obtain the shape of the dataset to understand the size of the dataset we will be working with.
# Shape of dataset
dim(resale_flat_prices_2017)
## [1] 173334 11
The dataset has 173,334 rows and 11 columns, considered quite a large dataset to deal with.
Next we print all column names to understand the variables we will be working with.
# Print column names
names(resale_flat_prices_2017)
## [1] "month" "town" "flat_type"
## [4] "block" "street_name" "storey_range"
## [7] "floor_area_sqm" "flat_model" "lease_commence_date"
## [10] "remaining_lease" "resale_price"
The dataset can be broken down into 5 main categories namely: time, location, flat details, lease and price.
Next we print a summary of the dataset to understand the datatypes of each column.
# Print summary
summary(resale_flat_prices_2017)
## month town flat_type block
## Min. :2017-01-01 Length:173334 Length:173334 Length:173334
## 1st Qu.:2019-01-01 Class :character Class :character Class :character
## Median :2020-12-01 Mode :character Mode :character Mode :character
## Mean :2020-10-01
## 3rd Qu.:2022-07-01
## Max. :2024-02-01
## street_name storey_range floor_area_sqm flat_model
## Length:173334 Length:173334 Min. : 31.00 Length:173334
## Class :character Class :character 1st Qu.: 82.00 Class :character
## Mode :character Mode :character Median : 93.00 Mode :character
## Mean : 97.24
## 3rd Qu.:112.00
## Max. :249.00
## lease_commence_date remaining_lease resale_price
## Min. :1966 Length:173334 Min. : 140000
## 1st Qu.:1985 Class :character 1st Qu.: 368000
## Median :1996 Mode :character Median : 463000
## Mean :1996 Mean : 493357
## 3rd Qu.:2009 3rd Qu.: 587000
## Max. :2022 Max. :1568888
The dataset has 2 data types, strings and numerical values. However, it looks like some data transformation is needed later on for some columns to convert its string value to a numerical value (eg. remaining_lease).
The code below checks the dataset for null values.
# null value check
na_check <- colSums(is.na(resale_flat_prices_2017)) > 0
print(na_check)
## month town flat_type block
## FALSE FALSE FALSE FALSE
## street_name storey_range floor_area_sqm flat_model
## FALSE FALSE FALSE FALSE
## lease_commence_date remaining_lease resale_price
## FALSE FALSE FALSE
The dataset does not look to have any null values for now. However, we cannot assume that the values are free of error of course. It could be the same where null values are filled with 0 or 9999.
The section below lists an overview of the research questions asked, the insights found, and the corresponding type of visualization used.
Is there a relationship between floor area and resale price?
Boxplot and Violin
From the boxplot, it can be observed that there exists outliers within each category except for the 200-250sqm range, indicating that while floor area affects the resale price, there are other stronger factors possibly like location which prompts buyers to pay more for the same floor area.
Further more the violin plot, the increasing sleekness of the shape as the floor area category increases suggests that as the floor area increases, the variance of resale prices buyers are willing to pay increases.
Does the storey range of a flat have an impact on its resale price?
Bar graph for each flat type and storey range.
Multigeneration and one room flats have a low storey range of only up to 12 floors. You would expect that as the storey range increases, the resale price increases, however after a certain floor range, usually the 90% of the storey range of that flat type, the resale prices falls, as seen from the 2room, 3room, 5room and execuetive flats.
Is there relationship between location and flat type?
Heatmap.
Darker colors indicate higher average resale prices. Lighter colors represent lower average resale prices, indicating more affordable options. From the heatmap, it can be deduced that location does matterfor resale price of the same flat type, as can be seen in how Central area and Queenstown outperforms its peers for the flat type.
To investigate this, we would need to:
First we plot a scatter plot with a regression line to understand if there is a trend in the floor area to resale price to flat type.
# Plot
ggplot(data = resale_flat_prices_2017, aes(x = floor_area_sqm, y = resale_price)) +
geom_point(aes(color = flat_type), alpha = 0.6) + # Points with updated color scheme
geom_smooth(method = "lm", se = FALSE, color = "black") + # Add regression line
labs(x = "Floor Area (sqm)", y = "Resale Price", title = "Relationship Between Floor Area and Resale Price") +
theme_minimal() +
scale_color_brewer(palette = "Set3")
## `geom_smooth()` using formula = 'y ~ x'
There seems to be a positive correlation between floor area and resale price, however the distinction between the flat types is not as clear due to the large set of data points. It might be good to further segment floor area for better readability.
First we call a summary of the floor area to understand the median, min max etc.
# Summary of 'floor_area_sqm' column
summary(resale_flat_prices_2017$floor_area_sqm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 82.00 93.00 97.24 112.00 249.00
Next we plot a boxplot to visualise the distribution of the floor area.
# Boxplot of 'floor_area_sqm' column
ggplot(resale_flat_prices_2017, aes(y = floor_area_sqm)) +
geom_boxplot(fill = "#b3b3b3", color = "#404040", outlier.color = "#606060") +
coord_flip() +
theme_minimal() +
labs(title = "Boxplot of Floor Area (sqm)",
y = "Floor Area (sqm)") +
theme(plot.title = element_text(hjust = 0.5),
axis.title.y = element_blank(),
axis.text = element_text(color = "#333333"),
axis.title = element_text(color = "#333333"))
With the minimum floor area at 31 and the maximum at 249, with majority of the values between 82-112, it might be good to segment the floor areas into breaks of 50 to make the visualization cleaner.
This section split the floor area into intervals of 50.
resale_flat_prices_2017$floor_area_interval <- cut(resale_flat_prices_2017$floor_area_sqm,
breaks = c(0, 50, 100, 150, 200, 250),
include.lowest = TRUE,
right = FALSE,)
# Boxplot
ggplot(data = resale_flat_prices_2017, aes(x = floor_area_interval, y = resale_price, fill = floor_area_interval)) +
geom_boxplot() +
scale_fill_brewer(palette = "Set3") +
labs(x = "Floor Area Interval (sqm)", y = "Resale Price", title = "Resale Price by Floor Area Interval") +
theme(axis.text.x = element_text(hjust = 0.5),
plot.title = element_text(hjust = 0.5))
# Violin Plot
ggplot(data = resale_flat_prices_2017, aes(x = floor_area_interval, y = resale_price, fill = floor_area_interval)) +
geom_violin(trim = FALSE) +
scale_fill_brewer(palette = "Set3") +
labs(x = "Floor Area Interval (sqm)", y = "Resale Price", title = "Resale Price Distribution by Floor Area Interval") +
theme(axis.text.x = element_text(hjust = 0.5),
plot.title = element_text(hjust = 0.5))
From the boxplot, it can be observed that there exists outliers within each category except for the 200-250sqm range, indicating that while floor area affects the resale price, there are other stronger factors possibly like location which prompts buyers to pay more for the same floor area.
Further more the violin plot, the increasing sleekness of the shape as the floor area category increases suggests that as the floor area increases, the variance of resale prices buyers are willing to pay increases.
This section checks for the unique values of the ‘storey_range’ column.
# Unique Values
unique(resale_flat_prices_2017$storey_range)
## [1] "10 TO 12" "01 TO 03" "04 TO 06" "07 TO 09" "13 TO 15" "19 TO 21"
## [7] "22 TO 24" "16 TO 18" "34 TO 36" "28 TO 30" "37 TO 39" "49 TO 51"
## [13] "25 TO 27" "40 TO 42" "31 TO 33" "46 TO 48" "43 TO 45"
The storey ranges of the flats looks to be up till 51 stories, at an interval range of 2 stories by category.
This section calculates the average price for each storey range category.
# Average Price
average_prices_storey_range <- resale_flat_prices_2017 %>%
group_by(storey_range) %>%
summarise(average_price = mean(resale_price, na.rm = TRUE))
average_prices_storey_range
ggplot(average_prices_storey_range, aes(x = storey_range, y = average_price, fill = average_price)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Storey Range", y = "Average Resale Price", title = "Average Resale Price by Storey Range") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1, size = 10)) +
theme(axis.title = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "none") +
guides(fill = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
It seems there is a clear trend of resale price increasing for the upper floors. We next bring in flat type to further sieve out more insights.
This section sorts the data by flat type too.
average_prices_storey_flat <- resale_flat_prices_2017 %>%
group_by(storey_range, flat_type) %>%
summarise(average_price = mean(resale_price, na.rm = TRUE))
## `summarise()` has grouped output by 'storey_range'. You can override using the
## `.groups` argument.
average_prices_storey_flat
ggplot(average_prices_storey_flat, aes(x = storey_range, y = average_price, fill = flat_type)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
scale_fill_brewer(palette = "Set3") +
labs(x = "Storey Range", y = "Average Resale Price", title = "Average Resale Price by Storey Range and Flat Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1),
axis.title = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom",
legend.title = element_blank())
Using a stacked bar graph makes it hard to compare the insights. We next do a facet wrap to plot out the graphs by flat type.
average_prices_storey_flat$storey_range_numeric <- as.numeric(factor(average_prices_storey_flat$storey_range))
ggplot(average_prices_storey_flat, aes(x = storey_range, y = average_price, fill = storey_range_numeric)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(x = "Storey Range", y = "Average Resale Price", title = "Average Resale Price by Storey Range, Faceted by Flat Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, face = "bold")) +
facet_wrap(~ flat_type, scales = "free_y", ncol = 1) +
guides(fill = "none") +
scale_y_continuous(labels = scales::number_format())
Multigeneration and one room flats have a low storey range of only up to 12 floors. You would expect that as the storey range increases, the resale price increases, however after a certain floor range, usually the 90% of the storey range of that flat type, the resale prices falls, as seen from the 2room, 3room, 5room and execuetive flats.
This section aggregreates the data from the original dataset.
average_resale_prices_town_flat_type <- resale_flat_prices_2017 %>%
group_by(town, flat_type) %>%
summarise(average_price = mean(resale_price, na.rm = TRUE))
## `summarise()` has grouped output by 'town'. You can override using the
## `.groups` argument.
average_resale_prices_town_flat_type
This section reshapes the data for a heatmap.
heatmap_data <- dcast(average_resale_prices_town_flat_type, town ~ flat_type, value.var = "average_price")
heatmap_data
# Melt the data back to a long format for ggplot2
heatmap_data_long <- melt(heatmap_data, id.vars = "town", variable.name = "flat_type", value.name = "average_price")
#Plot
ggplot(heatmap_data_long, aes(x = flat_type, y = town, fill = average_price)) +
geom_tile(color = "white", size = 0.1) + # Add border to each tile for better separation
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Blues"),
labels = label_number(big.mark = ",", decimal.mark = ".", accuracy = 1)) +
labs(x = "Flat Type", y = "Town", fill = "Average Price", title = "Average Resale Prices by Town and Flat Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), # Adjust for better readability
axis.text.y = element_text(size = 8), # Adjust size for better readability if needed
plot.title = element_text(hjust = 0.5, size = 14), # Center and size the title
legend.title = element_text(size = 10), # Adjust legend title size
legend.text = element_text(size = 8)) # Adjust legend text size
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Darker colors indicate higher average resale prices. Lighter colors represent lower average resale prices, indicating more affordable options. From the heatmap, it can be deduced that location does matter for resale price of the same flat type, as can be seen in how Central area and Queenstown outperforms its peers for the flat type.
Recreate the ridgeline plot showing the distribution of HDB resale prices of 4-room units by neighbourhood in price per square metre, in 2022.
# Calculate price per square meter
resale_flat_prices_2017$price_per_sqm <- resale_flat_prices_2017$resale_price / resale_flat_prices_2017$floor_area_sqm
#Sort for only 4-room transactions in 2022.
resale_2022_4rm <- resale_flat_prices_2017 %>%
filter(flat_type == "4 ROOM" & year(month) == 2022)
# Check
resale_2022_4rm
resale_2022_4rm$price_per_sqm <- as.numeric(as.character(resale_2022_4rm$price_per_sqm))
ggplot(resale_2022_4rm, aes(x = price_per_sqm, y = reorder(town, -price_per_sqm, FUN = median))) +
geom_density_ridges_gradient(
aes(fill = after_stat(x)),
scale = 3, rel_min_height = 0.01, gradient_lwd = 1.
) +
scale_fill_viridis_c(option = "C") +
labs(title = "HDB resale prices (4 room) in 2022 by neighbourhood",
subtitle = "Neighbourhoods exhibit large differences",
x = "Price per square metre ($)",
y = "") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
## Picking joint bandwidth of 304
ggplot(resale_2022_4rm, aes(x = price_per_sqm, y = reorder(town, -price_per_sqm, FUN = median))) +
geom_density_ridges(alpha=0.6, stat="binline", bins=20) +
scale_fill_viridis_c(option = "C", guide = "none") + # Hide the legend
labs(title = "HDB resale prices (4 room) in 2022 by neighbourhood",
subtitle = "Neighbourhoods exhibit large differences",
x = "Price per square metre ($)",
y = "") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
Tried another variation of the ridgeline plot, but does not perform as well as density ridges gradient
Most of the price per square metre falls within the $5000 range, as indicated by majority of the plot being purple color. Comparing the height of the ridgeplots relative to the towns, it can be deduced that towns further away from the city centre (eg. woodlands, pasir ris) have more transactions relative to the areas nearer the city centre like central area and queenstown. Furthermore, these sharp height and singular peak of these towns away from the city show low variation of the distribution of the prices. This is different from areas nearer the city centre, where there are multiple peaks / a more gradual spread of ridges across the price range.
Create a visual comparing prices between different years.
Comparison of HDB resale prices (4 room) between 2018 and 2023 by neighbourhood.
# Sort for only 4-room transactions from 2018 to 2023.
resale_2018_2023_4rm <- resale_flat_prices_2017 %>%
filter(flat_type == "4 ROOM" & year(month) >= 2018 & year(month) <= 2023)
# Check
head(resale_2018_2023_4rm)
# Calculate median prices for 2018 and 2023
median_prices <- resale_2018_2023_4rm %>%
group_by(town, year = year(month)) %>%
filter(year == 2018 | year == 2023) %>%
summarise(median_price_per_sqm = median(price_per_sqm), .groups = 'drop') %>%
spread(key = year, value = median_price_per_sqm)
# Check
head(median_prices)
median_prices_long <- median_prices %>%
pivot_longer(cols = c(`2018`, `2023`), names_to = "Year", values_to = "Price")
# Plot
ggplot(median_prices_long, aes(x = Price, y = reorder(town, Price), color = Year)) +
geom_segment(data = median_prices, aes(x = `2018`, xend = `2023`, y = town, yend = town), color = "grey") +
geom_point(size = 3) + # Place the geom_point layer after geom_segment
scale_color_manual(values = c(`2018` = "lightblue", `2023` = "darkblue")) +
scale_x_continuous(name = "Price per Square Metre ($)", labels = scales::dollar) +
labs(title = "Comparison of HDB resale prices (4 room) between 2018 and 2023",
subtitle = "Analysis by neighbourhood") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
guides(color = guide_legend(title = "Year"))
For the freeform task, I have decided to look at at chess dataset to unearth patterns and relationships, since chess is a hobby of mine. The dataset can be downloaded from https://www.kaggle.com/datasets/datasnaek/chess.
3.1 - What are the favourite chess openings for the different player classes?
Stacked Barchart
It can be concluded that within all player classes, the Sicilian Defence is a highly popular chess opening. Thus, considering the pros, you might want to learn these 3 openings, Sicilian Defence, French and Caro-Kann Defence for a higher probability of winning.
3.2 - Does white or black really give you an advantage in winning?
Barchart
Based from the analysis of the results of the game outcomes only, it can be deduced that playing as white has a small advantage and probability that you will win, probably due to the first turn advantage.
3.3 - Which chess opening should White use?
Barchart
Therefore to get the highest probability of winning, I will strive to play as white and use the Sicilian Defence.
The code below imports the downloaded data.
# Import Data
suppressMessages({
chess <- read_csv(
'data/games_metadata_profile_2024_01.csv'
)
})
print('loaded')
## [1] "loaded"
We first begin with an exploratory data analysis to scope the dataset.
head(chess)
dim(chess)
## [1] 130922 33
Dataset is quite large at 130,922 entries.
names(chess)
## [1] "GameID" "Event" "Round"
## [4] "Site" "Date" "Time"
## [7] "White" "WhiteElo" "WhiteRatingDiff"
## [10] "White_is_deleted" "White_tosViolation" "White_profile_flag"
## [13] "White_createdAt" "White_playTime_total" "White_count_all"
## [16] "White_title" "Black" "BlackElo"
## [19] "BlackRatingDiff" "Black_is_deleted" "Black_tosViolation"
## [22] "Black_profile_flag" "Black_createdAt" "Black_playTime_total"
## [25] "Black_count_all" "Black_title" "Moves"
## [28] "TotalMoves" "ECO" "Opening"
## [31] "TimeControl" "Termination" "Result"
33 possible features to play with.
# Number of Chess Openings
length(unique(chess$ECO))
## [1] 465
Quite a large number of different chess openings used. Interesting variable to work with. For reference, each chess opening is ascribed to a code.
# Boxplot
boxplot(chess$TotalMoves,
horizontal = TRUE,
main = "Total Moves in Chess Games",
xlab = "Total Moves",
ylab = "Frequency")
Most games are about 50 - 75 moves with the median at about 60 moves. Quite a number of outliers that fall outside the spectrum from 140+ moves onwards.
chess$Moves[1]
## [1] "1. d4 { [%eval 0.13] [%clk 0:05:00] } 1... d5 { [%eval 0.27] [%clk 0:05:00] } 2. c4 { [%eval 0.0] [%clk 0:05:02] } 2... Nf6?! { [%eval 0.71] [%clk 0:04:58] } 3. cxd5?! { [%eval 0.4] [%clk 0:05:02] } 3... Qxd5?! { [%eval 0.83] [%clk 0:04:58] } 4. Nc3?! { [%eval 0.5] [%clk 0:05:02] } 4... Qa5?! { [%eval 0.91] [%clk 0:04:55] } 5. Bd2?! { [%eval 0.95] [%clk 0:05:02] } 5... c6?! { [%eval 1.54] [%clk 0:04:56] } 6. e4?! { [%eval 1.37] [%clk 0:05:01] } 6... Qb4? { [%eval 3.79] [%clk 0:04:57] } 7. Nf3? { [%eval 1.99] [%clk 0:04:37] } 7... Qxb2?? { [%eval 5.21] [%clk 0:04:55] } 8. Bc4?? { [%eval 2.18] [%clk 0:04:22] } 8... Qb6?? { [%eval 1.99] [%clk 0:04:46] } 9. O-O?? { [%eval 2.32] [%clk 0:04:12] } 9... Bg4?? { [%eval 2.82] [%clk 0:04:44] } 10. Be2? { [%eval 1.3] [%clk 0:03:28] } 10... Bxf3? { [%eval 1.21] [%clk 0:04:42] } 11. Bxf3? { [%eval 1.4] [%clk 0:03:29] } 11... Qxd4? { [%eval 1.44] [%clk 0:04:42] } 12. Ne2? { [%eval -0.64] [%clk 0:03:24] } 12... Qd8? { [%eval 1.61] [%clk 0:04:41] } 13. Qc2? { [%eval -1.26] [%clk 0:03:16] } 13... Nbd7? { [%eval -1.48] [%clk 0:04:37] } 14. Rfd1? { [%eval -1.25] [%clk 0:03:16] } 14... e6? { [%eval -0.89] [%clk 0:04:36] } 15. Ng3?! { [%eval -1.85] [%clk 0:02:32] } 15... Qc7?! { [%eval -1.78] [%clk 0:04:36] } 16. a4?! { [%eval -2.1] [%clk 0:02:10] } 16... h5?! { [%eval -1.58] [%clk 0:04:30] } 17. Ne2?! { [%eval -1.92] [%clk 0:02:02] } 17... Ne5?! { [%eval -1.9] [%clk 0:04:30] } 18. Bf4?! { [%eval -2.19] [%clk 0:01:59] } 18... Nxf3+?! { [%eval -1.49] [%clk 0:04:28] } 19. gxf3?! { [%eval -1.46] [%clk 0:02:00] } 19... e5?! { [%eval -1.41] [%clk 0:03:56] } 20. Be3?! { [%eval -1.43] [%clk 0:01:57] } 20... Be7?! { [%eval -1.32] [%clk 0:03:29] } 21. Ng3?! { [%eval -1.72] [%clk 0:01:56] } 21... g6?! { [%eval -1.78] [%clk 0:03:21] } 22. a5?! { [%eval -1.82] [%clk 0:01:37] } 22... Qc8?! { [%eval -1.0] [%clk 0:03:18] } 23. Bc5?! { [%eval -1.69] [%clk 0:01:05] } 23... Bxc5?! { [%eval -1.45] [%clk 0:02:45] } 24. Qxc5?! { [%eval -1.56] [%clk 0:01:05] } 24... Qc7?! { [%eval -1.4] [%clk 0:02:26] } 25. Rd6?! { [%eval -2.24] [%clk 0:00:55] } 25... Qe7? { [%eval -0.1] [%clk 0:02:12] } 26. Rad1? { [%eval -1.12] [%clk 0:00:40] } 26... O-O?! { [%eval -0.45] [%clk 0:02:06] } 27. h3? { [%eval -2.14] [%clk 0:00:07] } 27... Rad8? { [%eval -2.22] [%clk 0:02:03] } 0-1"
The cells in each ‘Moves’ column looks to contain a list of moves placed by both players. This looks to be the most interesting column to work with although it would require some pre-processing.
# Get unique values from the 'White' and 'Black' columns
unique_white <- unique(chess$White)
unique_black <- unique(chess$Black)
# Remove values present in 'White' from unique values in 'Black'
unique_black <- unique_black[!unique_black %in% unique_white]
# Sum up the lengths of unique values from both columns
total_unique <- length(unique_white) + length(unique_black)
total_unique
## [1] 202566
The number of unique players in this dataset is quite high at 202,566 players. Doing an analysis on the interactions between these players might yield something interesting.
# Plot boxplot for WhiteElo
ggplot(chess, aes(y = WhiteElo)) +
geom_boxplot() +
coord_flip() +
labs(y = "White Elo", title = "Boxplot of White Elo Ratings")
summary(chess$WhiteElo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 400 1318 1610 1618 1907 3233
# Plot boxplot for BlackElo
ggplot(chess, aes(y = BlackElo)) +
geom_boxplot() +
coord_flip() +
labs(y = "Black Elo", title = "Boxplot of Black Elo Ratings")
summary(chess$BlackElo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 400 1317 1613 1618 1907 3198
Majority of players fell within the Elo range of 1300 - 1907. With reference the the chess rating system (https://en.wikipedia.org/wiki/Chess_rating_system), most players are within Class D - Class A. However from the boxplot, it can be seen there are many outliers at opposite ends of the spectrum, presumably new players and very experienced players, with the highest rating at 3233, which are ‘super grandmasters’.
Thus seperating the dataset into 3 groups of players, beginners, amateurs and advanced players, might reveal insights into the differences in their decision making.
# Extract rows
chess_beginner <- chess[chess$WhiteElo < 1318 & chess$BlackElo < 1317, ]
# Number of Games
nrow(chess_beginner)
## [1] 29543
# Extract rows
chess_amateur <- chess[(chess$WhiteElo > 1318 & chess$WhiteElo < 1907) | (chess$BlackElo > 1317 & chess$BlackElo < 1907), ]
# Number of Games
nrow(chess_amateur)
## [1] 72314
# Extract rows
chess_pro <- chess[chess$WhiteElo > 1907 | chess$BlackElo > 1907, ]
# Number of Games
nrow(chess_pro)
## [1] 36848
# Get row counts of the dataframes
row_counts <- c(nrow(chess_beginner), nrow(chess_amateur), nrow(chess_pro))
# Names of the dataframes
df_names <- c("Beginner", "Amateur", "Pro")
# Create bar plot
barplot(row_counts, names.arg = df_names, xlab = "Player Class", ylab = "Number of Games", main = "Game Count by Player Class", col = "skyblue")
#### Observation The original dataframe has been split
into 3 categories of players, with the bulk of the players being
amateur players.
# Define a function to count unique openings for a dataframe
count_unique_openings <- function(df) {
df %>%
group_by(ECO) %>%
summarise(Count = n())
}
# Apply the function to each dataframe
chess_beginner_counts <- count_unique_openings(chess_beginner)
chess_amateur_counts <- count_unique_openings(chess_amateur)
chess_pro_counts <- count_unique_openings(chess_pro)
# Number of Unique Openings for each Player Class
nrow(chess_beginner_counts)
## [1] 257
nrow(chess_amateur_counts)
## [1] 400
nrow(chess_pro_counts)
## [1] 459
chess_pro_counts
In this section we map the opening names to the ECO IDs. The ECO ranges are referenced from: https://www.365chess.com/eco.php
# Define ECO ranges and corresponding opening names
eco_ranges <- data.frame(
Start = c("A00", "A01", "A02", "A04", "A10", "A40", "A42", "A43", "A45", "A47", "A48", "A50", "A51", "A53", "A56", "A57", "A60", "A80", "B00", "B01", "B02", "B06", "B07", "B10", "B20", "C00", "C20", "C21", "C23", "C25", "C30", "C40", "C41", "C42", "C44", "C45", "C46", "C47", "C50", "C51", "C53", "C55", "C60", "D00", "D01", "D02", "D03", "D04", "D06", "D07", "D10", "D16", "D17", "D20", "D30", "D43", "D50", "D70", "D80", "E00", "E01", "E10", "E11", "E12", "E20", "E60"),
End = c("A00", "A01", "A03", "A09", "A39", "A41", "A42", "A44", "A46", "A47", "A49", "A50", "A52", "A55", "A56", "A59", "A79", "A99", "B00", "B01", "B05", "B06", "B09", "B19", "B99", "C19", "C20", "C22", "C24", "C29", "C39", "C40", "C41", "C43", "C44", "C45", "C46", "C49", "C50", "C52", "C54", "C59", "C99", "D00", "D01", "D02", "D03", "D05", "D06", "D09", "D15", "D16", "D19", "D29", "D42", "D49", "D69", "D79", "D99", "E00", "E09", "E10", "E11", "E19", "E59", "E99"),
Opening = c("Polish Opening", "Nimzovich-Larsen attack", "Bird's opening", "Reti Opening", "English Opening", "Queen's Pawn", "Modern defence, Averbakh system", "Old Benoni defence", "Queen's Pawn Game", "Queen's Indian defence", "King's Indian, East Indian defence", "Queen's pawn game", "Budapest defence", "Old Indian defence", "Benoni defence", "Benko gambit", "Benoni defence", "Dutch", "King's pawn opening", "Scandinavian Defence", "Alekhine's defence", "Robatsch Defence", "Pirc defence", "Caro-Kann defence", "Sicilian Defence", "French Defence", "King's pawn game", "Centre Game", "Bishop's opening", "Vienna game", "King's gambit", "King's knight opening", "Philidor's defence", "Petrov's defence", "King's pawn game", "Scotch game", "Three knights game", "Four knights, Scotch variation", "Italian Game", "Evans gambit", "Giuoco Piano", "Two knights defence", "Ruy Lopez (Spanish opening)", "Queen's pawn game", "Richter-Veresov attack", "Queen's pawn game", "Torre attack (Tartakower variation)", "Queen's Pawn Game", "Queen's Gambit", "Queen's Gambit Declined, Chigorin defence", "Queen's Gambit Declined Slav defence", "Queen's Gambit Declined Slav accepted, Alapin variation", "Queen's Gambit Declined Slav, Czech defence", "Queen's gambit accepted", "Queen's gambit declined", "Queen's Gambit Declined semi-Slav", "Queen's Gambit Declined", "Neo-Gruenfeld defence", "Gruenfeld defence", "Queen's Pawn Game", "Catalan, closed", "Queen's Pawn Game", "Bogo-Indian defence", "Queen's Indian defence", "Nimzo-Indian defence", "King's Indian defence")
)
# Function to map ECO codes to opening names
map_eco_to_opening <- function(df) {
df <- df %>%
mutate(ECO_Start = str_sub(ECO, 1, 3)) %>%
rowwise() %>%
mutate(Opening = list(eco_ranges %>%
filter(ECO_Start >= Start & ECO_Start <= End) %>%
select(Opening) %>%
pull(Opening))) %>%
unnest(Opening) %>%
select(-ECO_Start)
df
}
# Apply the function to your dataframes
chess_beginner_counts <- map_eco_to_opening(chess_beginner_counts)
chess_amateur_counts <- map_eco_to_opening(chess_amateur_counts)
chess_pro_counts <- map_eco_to_opening(chess_pro_counts)
aggregate_counts <- function(df) {
df %>%
group_by(Opening) %>%
summarise(TotalCount = sum(Count)) %>%
arrange(desc(TotalCount))
}
chess_beginner_agg <- aggregate_counts(chess_beginner_counts)
chess_amateur_agg <- aggregate_counts(chess_amateur_counts)
chess_pro_agg <- aggregate_counts(chess_pro_counts)
chess_pro_agg
ggplot(chess_amateur_agg, aes(x = reorder(Opening, TotalCount), y = TotalCount)) +
geom_bar(stat = "identity") +
coord_flip()
Too many openings, distribution is too wide.
# Select top 10 openings based on TotalCount for each class
top_beginner <- chess_beginner_agg %>%
arrange(desc(TotalCount)) %>%
top_n(10, TotalCount) %>%
mutate(Class = 'Beginner')
top_amateur <- chess_amateur_agg %>%
arrange(desc(TotalCount)) %>%
top_n(10, TotalCount) %>%
mutate(Class = 'Amateur')
top_pro <- chess_pro_agg %>%
arrange(desc(TotalCount)) %>%
top_n(10, TotalCount) %>%
mutate(Class = 'Pro')
# Combine into a single dataframe
combined_top_openings <- bind_rows(top_beginner, top_amateur, top_pro)
# Fill in missing combinations with zero counts
combined_top_openings <- combined_top_openings %>%
complete(Opening, Class, fill = list(TotalCount = 0)) %>%
arrange(Opening, Class)
# Plot
ggplot(combined_top_openings, aes(x = Opening, y = TotalCount, fill = Class)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
coord_flip() +
theme(axis.text.y = element_text(angle = 0, hjust = 1)) +
labs(title = "Top 10 Chess Openings by Player Class",
y = "Chess Opening",
x = "Total Count") +
scale_fill_brewer(palette = "Set1") +
theme_minimal()
###### Observation Need to normalise the
openings since the total count of games played per class is
different.
# Select top 10 openings based on TotalCount for each class and normalize counts
top_beginner <- chess_beginner_agg %>%
arrange(desc(TotalCount)) %>%
top_n(10, TotalCount) %>%
mutate(Class = 'Beginner',
NormalizedCount = TotalCount / sum(TotalCount)) # Normalizing counts
top_amateur <- chess_amateur_agg %>%
arrange(desc(TotalCount)) %>%
top_n(10, TotalCount) %>%
mutate(Class = 'Amateur',
NormalizedCount = TotalCount / sum(TotalCount)) # Normalizing counts
top_pro <- chess_pro_agg %>%
arrange(desc(TotalCount)) %>%
top_n(10, TotalCount) %>%
mutate(Class = 'Pro',
NormalizedCount = TotalCount / sum(TotalCount)) # Normalizing counts
# Combine into a single dataframe
combined_top_openings <- bind_rows(top_beginner, top_amateur, top_pro)
# Ensure each class has an entry for each opening by filling in missing combinations with zero counts
combined_top_openings <- combined_top_openings %>%
complete(Opening, Class, fill = list(TotalCount = 0, NormalizedCount = 0)) %>%
arrange(Opening, Class)
# Plot
ggplot(combined_top_openings, aes(x = NormalizedCount, y = Opening, fill = Class)) +
geom_bar(stat = "identity", position = "dodge", color = "white") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Top 10 Chess Openings by Player Class",
subtitle = "What player class are you?",
x = "Normalized Count") +
theme_minimal() +
theme(axis.line = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_text(hjust = 1),
plot.title = element_text(hjust = 0, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0, size = 12),
legend.title = element_blank())
It can be concluded that within all player classes, the Sicilian Defence is a highly popular chess opening. Thus, considering the pros, you might want to learn these 3 openings, Sicilian Defence, French and Caro-Kann Defence for a higher probability of winning.
This section explores the hypothesis if white starting first really gives a significant turn advantage for winning.
# Convert the 'Result' column into a more descriptive outcome
chess<- chess%>%
filter(Result %in% c("1-0", "0-1", "1/2-1/2")) %>%
mutate(Outcome = case_when(
Result == "1-0" ~ "White Win",
Result == "0-1" ~ "Black Win",
Result == "1/2-1/2" ~ "Draw"
))
# Count the number of each outcome type
outcome_counts <- chess%>%
group_by(Outcome) %>%
summarise(Count = n())
# Plot
ggplot(outcome_counts, aes(x = Outcome, y = Count, fill = Outcome)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Chess Game Outcomes by Starting Position",
x = "Outcome",
y = "Count",
fill = "Outcome") +
geom_text(aes(label = Count), vjust = -0.3)
Cannot really tell the difference visually between black and white.
# Calculate the percentages
outcome_counts <- chess%>%
filter(Result %in% c("1-0", "0-1", "1/2-1/2")) %>%
mutate(Outcome = case_when(
Result == "1-0" ~ "White Win",
Result == "0-1" ~ "Black Win",
Result == "1/2-1/2" ~ "Draw"
)) %>%
group_by(Outcome) %>%
summarise(Count = n()) %>%
mutate(Percentage = (Count / sum(Count)) * 100)
# Plot
ggplot(outcome_counts, aes(x = Outcome, y = Percentage, fill = Outcome)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = sprintf("%.2f%%", Percentage)), vjust = -0.5, size = 3.5) +
scale_fill_manual(values = c("White Win" = "#1f77b4", "Black Win" = "#ff7f0e", "Draw" = "#2ca02c")) +
theme_minimal() +
labs(title = "Percentage of Chess Game Outcomes by Starting Position",
x = "Outcome",
y = "Percentage")
Visually still unable to see that white has a higher probablity of winning than black.
outcome_counts <- outcome_counts %>%
filter(Outcome != "Draw")
ggplot(outcome_counts, aes(x = Outcome, y = Percentage, fill = Outcome)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() + # Make the bar chart horizontal
scale_fill_manual(values = c("Black Win" = "black", "White Win" = "white")) +
labs(title = "Does the first turn matter?",
subtitle = "An analysis of 130,922 game outcomes",
x = "",
y = "Percentage (%)") +
theme_minimal() +
theme(axis.title.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.title = element_blank(),
plot.title = element_text(hjust = 0, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0, size = 12),
plot.background = element_rect(fill = "#f0f0f0"),
panel.background = element_rect(fill = "#f0f0f0", color = NA))
Based from the analysis of the results of the game outcomes only, it can be deduced that playing as white has a small advantage and probability that you will win, probably due to the first turn advantage.
Now that we have established that playing as white as a slight advantage of winning, I want to know which are the chess opening I should use when playing as white.
We match the opening names to the eco codes since the existing opening columns contains the opening names but it different formats.
# Function to map ECO codes to opening names
map_eco_to_opening <- function(ECO_code) {
opening <- eco_ranges$Opening[which(ECO_code >= eco_ranges$Start & ECO_code <= eco_ranges$End)]
if(length(opening) > 0) {
return(opening[1])
} else {
return(NA)
}
}
# Apply the function to create 'opening_new' column
chess$opening_new <- sapply(chess$ECO, map_eco_to_opening)
We filter for the games where white wins
# Filter
opening_win <- chess %>%
filter(Result == '1-0')
head(opening_win)
#Count
opening_counts <- opening_win %>%
group_by(opening_new) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
# Plot
ggplot(opening_counts, aes(x = reorder(opening_new, Count), y = Count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Count of Chess Openings in Wins for White",
x = "Opening",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#### Observation We can afford to remove the bottom few
openings since they are rarely used.
# Calculate the 25th percentile of the counts
threshold <- quantile(opening_counts$Count, 0.25)
# Filter out the openings in the bottom 25%
filtered_opening_counts <- opening_counts %>%
filter(Count > threshold)
ggplot(filtered_opening_counts, aes(x = reorder(opening_new, Count), y = Count)) +
geom_bar(stat = "identity", fill = "darkgrey") +
coord_flip() +
labs(title = "Winning Chess Openings for White",
subtitle = "Winning is Everything",
x = "Opening",
y = "") +
theme_minimal() +
theme(axis.title.y = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10),
axis.text.x = element_text(hjust = 1))
Therefore to get the highest probability of winning, I will strive to play as white and use the Sicilian Defence.
For the sake of curiosity, I also plotted black.
# Filter
opening_win_b <- chess %>%
filter(Result == '0-1')
#Count
opening_count_b <- opening_win_b %>%
group_by(opening_new) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
# Calculate the 25th percentile of the counts
threshold <- quantile(opening_count_b$Count, 0.25)
# Filter out the openings in the bottom 25%
filtered_opening_count_b <- opening_count_b %>%
filter(Count > threshold)
ggplot(filtered_opening_count_b, aes(x = reorder(opening_new, Count), y = Count)) +
geom_bar(stat = "identity", fill = "black") +
coord_flip() +
labs(title = "Winning Chess Openings for Black",
subtitle = "Winning is Everything",
x = "Opening",
y = "") +
theme_minimal() +
theme(axis.title.y = element_blank(),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10),
axis.text.x = element_text(hjust = 1))
Looks like Sicilian Defence is the best overall strategy.
Note that fact there different types of chess games were played and different types might result in different playing styles and decisions which might affect the moves / openings used.
A player interaction analysis would also be interesting…..
I acknowledge that the code can be more efficient in terms of compiling time and unnecessary variables but its 3am and I’m really tired, I promise assignment 2’s code will be more efficient.
I hope you enjoyed the insights from the chess dataset because I did. I shall now endeavour to start my matches as white and open with the Sicilian Defense.